-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
BUG: make read_csv
be able to read large floating numbers into float
#62542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: make read_csv
be able to read large floating numbers into float
#62542
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
read_csv
be able to read large numbers into floatread_csv
be able to read large floating numbers into float
pandas/_libs/parsers.pyx
Outdated
if na_filter and kh_get_str_starts_item(na_hashset, word): | ||
continue | ||
|
||
if self.parser.decimal in word or b"e" in word or b"E" in word: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we know if a word like
"NaN"
would reach this point? - Is it naive to try
float(word)
to check if word is a float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it naive to try float(word) to check if word is a float?
I don't think it will work because integers can be cast into floats.
Do we know if a word like "NaN" would reach this point?
I added a print statement before the return True
and ran the test suite in pandas/tests/io/parser
. The word "NaN" didn't reach. The words that flagged the column as float are shown below.
It is unintended that strings that contain the letter "e" flags the column as "float", but since this function only serves to skip int parsing, the result isn't that problematic.
Marking as float because of b'0.0'
Marking as float because of b' 0.0000'
Marking as float because of b' 0.0100'
Marking as float because of b'0.056674973'
Marking as float because of b' 0,1 '
Marking as float because of b'0,1'
Marking as float because of b'0.1'
Marking as float because of b'01:10:18.300'
Marking as float because of b'0.18905338179353307'
Marking as float because of b'0.2'
Marking as float because of b'0.212036'
Marking as float because of b'0.2140'
Marking as float because of b'0.2616121342493164'
Marking as float because of b'0.31814660061537436'
Marking as float because of b'0355626618.16711'
Marking as float because of b'-0.364216805298'
Marking as float because of b'-0.41306354339189344'
Marking as float because of b'0.43263079080478717'
Marking as float because of b'0.5'
Marking as float because of b'0.5165781941249967'
Marking as float because of b'-0.5227484414807474'
Marking as float because of b'-0.689265'
Marking as float because of b'-0.692787'
Marking as float because of b' 0.8100'
Marking as float because of b'0.980268513777'
Marking as float because of b' ,1 '
Marking as float because of b' -,1 '
Marking as float because of b' _1, '
Marking as float because of b' _1,_ '
Marking as float because of b' 1, '
Marking as float because of b' 1_, '
Marking as float because of b',1'
Marking as float because of b'-,1'
Marking as float because of b'_1,'
Marking as float because of b'_1,_'
Marking as float because of b'1,'
Marking as float because of b'1.'
Marking as float because of b'1_,'
Marking as float because of b' -1,0 '
Marking as float because of b'-1,0'
Marking as float because of b'1.0'
Marking as float because of b'10,'
Marking as float because of b'10.'
Marking as float because of b' 1_000,000_000 '
Marking as float because of b'1_000,000_000'
Marking as float because of b'1.00361'
Marking as float because of b'1032.43'
Marking as float because of b'1.050000000000000044408921'
Marking as float because of b'10E-100000'
Marking as float because of b'10E-617'
Marking as float because of b'10E-99999999999999999'
Marking as float because of b'10E-999999999999999999'
Marking as float because of b'10E999999999999999999'
Marking as float because of b'1.1'
Marking as float because of b'1.100000000000000088817842'
Marking as float because of b'1.12551'
Marking as float because of b'1.149999999999999911182158'
Marking as float because of b'-1.15973806169'
Marking as float because of b'1.199999999999999955591079'
Marking as float because of b' ,1__2 '
Marking as float because of b' --1,2 '
Marking as float because of b' 1,_2 '
Marking as float because of b',1__2'
Marking as float because of b'--1,2'
Marking as float because of b'1,_2'
Marking as float because of b'1.2'
Marking as float because of b'1.200'
Marking as float because of b' 1,2_1 '
Marking as float because of b'1,2_1'
Marking as float because of b' 1,2,2 '
Marking as float because of b'1,2,2'
Marking as float because of b' 1_234,56 '
Marking as float because of b'1_234,56'
Marking as float because of b'12345,67'
Marking as float because of b' 1_234,56e0 '
Marking as float because of b'1_234,56e0'
Marking as float because of b'1234E+0'
Marking as float because of b'1.25'
Marking as float because of b' -1,2e0 '
Marking as float because of b'-1,2e0'
Marking as float because of b' 1,2e_1 '
Marking as float because of b'1,2e_1'
Marking as float because of b' 1,2E-1 '
Marking as float because of b' 1,2E1 '
Marking as float because of b'1,2E-1'
Marking as float because of b'1,2E1'
Marking as float because of b' 1,2e1_0 '
Marking as float because of b'1,2e1_0'
Marking as float because of b' 1,2e-10e1 '
Marking as float because of b'1,2e-10e1'
Marking as float because of b'1.300000000000000044408921'
Marking as float because of b'1.350000000000000088817842'
Marking as float because of b'1352171357E+5'
Marking as float because of b'1.399999999999999911182158'
Marking as float because of b'1.4'
Marking as float because of b'1.449999999999999955591079'
Marking as float because of b'14.7674'
Marking as float because of b'1.5'
Marking as float because of b'1521,1541'
Marking as float because of b'1.550000000000000044408921'
Marking as float because of b'1.600000000000000088817842'
Marking as float because of b'1.649999999999999911182158'
Marking as float because of b'1.700000000000000177635684'
Marking as float because of b'1.75'
Marking as float because of b'179.71425'
Marking as float because of b'1.800000000000000044408921'
Marking as float because of b' 18446744073709551616.0'
Marking as float because of b' 18446744073709551616.5'
Marking as float because of b'1.850000000000000088817842'
Marking as float because of b'187101,9543'
Marking as float because of b'1.899999999999999911182158'
Marking as float because of b'1917.09447'
Marking as float because of b'1.950000000000000177635684'
Marking as float because of b' 1a_2,1 '
Marking as float because of b'1a_2,1'
Marking as float because of b' ,1e '
Marking as float because of b' -,1e '
Marking as float because of b',1e'
Marking as float because of b'-,1e'
Marking as float because of b'1E'
Marking as float because of b' +1,e0 '
Marking as float because of b' -1,e0 '
Marking as float because of b'+1,e0'
Marking as float because of b'-1,e0'
Marking as float because of b' +1e+0 '
Marking as float because of b' +1e0 '
Marking as float because of b' -_1e0 '
Marking as float because of b' -1e0 '
Marking as float because of b' _1e0 '
Marking as float because of b'+1e+0'
Marking as float because of b'+1e0'
Marking as float because of b'-_1e0'
Marking as float because of b'-1e0'
Marking as float because of b'_1e0'
Marking as float because of b' +,1e1 '
Marking as float because of b' +1e-1 '
Marking as float because of b' -,1e1 '
Marking as float because of b'+,1e1'
Marking as float because of b'+1e-1'
Marking as float because of b'-,1e1'
Marking as float because of b' 1e11,2 '
Marking as float because of b'1e11,2'
Marking as float because of b' 1,e1_2 '
Marking as float because of b'1,e1_2'
Marking as float because of b'2.'
Marking as float because of b'2.0'
Marking as float because of b'2.2'
Marking as float because of b' 2.2100'
Marking as float because of b'225.874'
Marking as float because of b'2,334.01'
Marking as float because of b'2.334,01'
Marking as float because of b'240.000'
Marking as float because of b'243.164'
Marking as float because of b'2456026.548822908'
Marking as float because of b'2.5'
Marking as float because of b'252.373'
Marking as float because of b' 260.0000'
Marking as float because of b' 280.0000'
Marking as float because of b' 2.8100'
Marking as float because of b'2e'
Marking as float because of b'3.'
Marking as float because of b'314.11625'
Marking as float because of b' 32.0'
Marking as float because of b' 32e0'
Marking as float because of b' 3.2e1'
Marking as float because of b' 3.2e-80'
Marking as float because of b' 3.2e80'
Marking as float because of b'3.3000000000000003'
Marking as float because of b'330.65659'
Marking as float because of b'3.4'
Marking as float because of b'344.98370'
Marking as float because of b'3.5'
Marking as float because of b'3.68573087906'
Marking as float because of b' 36893488147419103232.3'
Marking as float because of b'3E'
Marking as float because of b'4.'
Marking as float because of b'412.166'
Marking as float because of b'41.605'
Marking as float because of b'42e'
Marking as float because of b'4.5'
Marking as float because of b'45.'
Marking as float because of b'45e-1'
Marking as float because of b'4,738797819'
Marking as float because of b'4.8'
Marking as float because of b'5.'
Marking as float because of b'5.1'
Marking as float because of b'632E'
Marking as float because of b'64.0'
Marking as float because of b'65248E10'
Marking as float because of b' .67'
Marking as float because of b'70.06056'
Marking as float because of b' 7.2000'
Marking as float because of b'73.48821'
Marking as float because of b'7.5'
Marking as float because of b' .78'
Marking as float because of b'80.000'
Marking as float because of b' .81'
Marking as float because of b'.86'
Marking as float because of b' .88'
Marking as float because of b'-9,1'
Marking as float because of b'apple'
Marking as float because of b'DEF'
Marking as float because of b'e'
Marking as float because of b' e11,2 '
Marking as float because of b'e11,2'
Marking as float because of b'e,d'
Marking as float because of b'EEE'
Marking as float because of b'e\n d'
Marking as float because of b'example\n sentence\n two'
Marking as float because of b'False'
Marking as float because of b' hello'
Marking as float because of b'hello'
Marking as float because of b'hello\nthere'
Marking as float because of b'"hello world"'
Marking as float because of b'http://www.ikea.com/se/sv/catalog/categories/departments/living_room/10475/?se%7cps%7cnonbranded%7cvardagsrum%7cgoogle%7ctv_bord'
Marking as float because of b'Hugo Chavez'
Marking as float because of b'Hugo Ch\xc3\xa1vez'
Marking as float because of b'Hugo Ch\xc3\xa1vez Fr\xc3\xadas'
Marking as float because of b'Hugo Rafael Chavez Frias'
Marking as float because of b'index'
Marking as float because of b'index1'
Marking as float because of b'Iris-setosa'
Marking as float because of b'King of New York (1990)'
Marking as float because of b"line '21' line 22"
Marking as float because of b"line '21\n' line 22"
Marking as float because of b'line 21\nline 22'
Marking as float because of b"line '21\n' \r\tline 22"
Marking as float because of b"line \n'21' line 22"
Marking as float because of b'None'
Marking as float because of b'one'
Marking as float because of b'President'
Marking as float because of b'qwer'
Marking as float because of b'Raphael'
Marking as float because of b'rectangular'
Marking as float because of b'red'
Marking as float because of b'rez'
Marking as float because of b'SELL'
Marking as float because of b'Sixth Man, The (1997)'
Marking as float because of b'SLAGBORD, "Bergslagen", IKEA:s 1700-tals series'
Marking as float because of b'somedatasomedatasomedata1'
Marking as float because of b'tables'
Marking as float because of b' test'
Marking as float because of b'test'
Marking as float because of b'test \x1a test'
Marking as float because of b'True'
Marking as float because of b'TRUE'
Marking as float because of b'Venezuela'
Marking as float because of b'\xe3\x81\x9d\xe3\x81\xae\xe7\xb6\x9a\xe7\xb7\xa8\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b\xe3\x80\x8e\xe6\x8c\x87\xe8\xbc\xaa\xe7\x89\xa9\xe8\xaa\x9e\xe3\x80\x8f\xe3\x81\xab\xe3\x81\x8a\xe3\x81\x84\xe3\x81\xa6\xe3\x81\xaf\xe3\x80\x8c\xe4\xb8\x80\xe3\x81\xa4\xe3\x81\xae\xe6\x8c\x87\xe8\xbc\xaa\xef\xbc\x88the One Ring\xef\xbc\x89\xe3\x80\x8d\xe3\x81\xae\xe4\xbd\x9c\xe3\x82\x8a\xe4\xb8\xbb\xe3\x80\x81\xe3\x80\x8c\xe5\x86\xa5\xe7\x8e\x8b\xef\xbc\x88Dark Lord\xef\xbc\x89\xe3\x80\x8d\xe3\x80\x81\xe3\x80\x8c\xe3\x81\x8b\xe3\x81\xae\xe8\x80\x85\xef\xbc\x88the One\xef\xbc\x89[1]\xe3\x80\x8d\xe3\x81\xa8\xe3\x81\x97\xe3\x81\xa6\xe7\x99\xbb\xe5\xa0\xb4\xe3\x81\x99\xe3\x82\x8b\xe3\x80\x82\xe5\x89\x8d\xe5\x8f\xb2\xe3\x81\xab\xe3\x81\x82\xe3\x81\x9f\xe3\x82\x8b\xe3\x80\x8e\xe3\x82\xb7\xe3\x83\xab\xe3\x83\x9e\xe3\x83\xaa\xe3\x83\xab\xe3\x81\xae\xe7\x89\xa9\xe8\xaa\x9e\xe3\x80\x8f\xe3\x81\xa7\xe3\x81\xaf\xe3\x80\x81\xe5\x88\x9d\xe4\xbb\xa3\xe3\x81\xae\xe5\x86\xa5\xe7\x8e\x8b\xe3\x83\xa2\xe3\x83\xab\xe3\x82\xb4\xe3\x82\xb9\xe3\x81\xae\xe6\x9c\x80\xe3\x82\x82\xe5\x8a\x9b\xe3\x81\x82\xe3\x82\x8b\xe5\x81\xb4\xe8\xbf\x91\xe3\x81\xa7\xe3\x81\x82\xe3\x81\xa3\xe3\x81\x9f\xe3\x80\x82'
Marking as float because of b'YES'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the verification to make it return early if it finds a word that isn't numeric.
Co-authored-by: Matthew Roeschke <[email protected]>
Here are the benchmarks on io.csv comparing the latest changes (without 380e6ec) against main. The worst performance hit was an increase of 1.43ms to 4.04ms on
|
Verification performance overhead reduced from 2.83 to 2.47
I managed to reduce the performance overhead of the verification a little bit. If the current solution is not satisfactory, I think it's possible to change the |
pandas/_libs/parsers.pyx
Outdated
ignored_chars = b" +-" | ||
digits = b"0123456789" | ||
float_indicating_chars = b"eE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth typing these in the cdef
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it and the test suite in io/parser/common
passed. I don't recall how it was transpiled before. But I checked with the new changes and it's using native C types.
pandas/_libs/parsers.pyx
Outdated
|
||
found_first_digit = False | ||
j = 0 | ||
while word[j] != b"\0": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this terminating char here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the null byte definition to cdef.
const char *ignored_chars = " +-" | ||
const char *digits = "0123456789" | ||
const char *float_indicating_chars = "eE" | ||
char null_byte = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to use = 0
instead of = '\0'
because cython-lint doesn't let me use single quote, and it doesn't compile with double quote.
Thanks @Alvaro-Kothe |
elif not found_first_digit and word[j] not in digits: | ||
# word isn't numeric | ||
return False | ||
elif not found_first_digit and word[j] in digits: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<ctype.h> provides an isdigit
function that would be preferable to use here, rather than rolling our own implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also strtof
in <stdlib.h>` that you could use in lieu of most of this code
Any chance we've run the benchmarks for CSV reading? This might be expensive to take a second pass at columns like this |
I ran them on my system and posted the results on #62542 (comment). The performance penalty was 2.5x slower from commit b3519c1. @WillAyd I will open another PR to reduce the performance penalty caused by these changes.
I think that the most performant approach is to verify if is float while it's parsing to integer in |
Very nice - thanks for the thorough analysis @Alvaro-Kothe . I agree that its probably best to do this during tokenization, although I'm a bit wary of anything that is going to add performance overhead. Floats beyond 2^64 are going to suffer from precision loss, so it doesn't seem like general performance should take a step back for that |
Understood. This performance overhead for a big floating number edge case may not be worth it. I will prototype a tokenization solution to see if the performance overhead is negligible to what was before. If it isn't, I will prepare a PR to revert this one. |
read_csv()
fails to detect floats larger than 2.0^64 #51295 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Notes
pandas/tests/io/parser/common/test_ints.py
).Implementation details
Skips the integer parsing if it detects a potential floating number by checking for the presence of: